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Abstract 

We show that it is impossible to "boost" the level of fault-tolerance of a system solving 
consensus by combining less fault-tolerant components into a more fault-tolerant system. To do 
this, we consider an asynchronous distributed computing model in which a known set of processes 
interact in two ways: by using reliable point-to-point channels, and by accessing shared services. 
Each of the shared services is connected to a subset of all the processes. 

Our boosting impossibility result is: for any / > 1, the consensus problem is unsolvable in 
this model in the presence of up to / process stopping failures, if each of the shared services 
is assumed to tolerate only / — 1 process failures. This result holds regardless of the types of 
the shared services and the pattern of connectivity of processes and services. In particular, it 
is impossible to construct a protocol to solve the consensus problem for / process failures using 
any number of consensus services that tolerate / — 1 process failures. 

Interestingly, it is possible to boost the level of a system solving problems easier than con- 
sensus. For example, we show that the ^-consensus problem is solvable for 2k — 1 failures using 
only (consensus) services that tolerate only 1 failure apiece. 



1 Introduction 

It is generally accepted that large distributed systems should be constructed from building blocks 
(such as middleware-provided services) that interact with each other through well-defined inter- 
faces. Large systems must also tolerate a variety of types of failures. Establishing fault-tolerance 
properties of a large system is difficult, as many scenarios have to be considered. A particularly 
desirable approach is to "boost" the level of fault-tolerance by combining less fault-tolerant compo- 
nents into a more fault-tolerant system. It is plausible that this might be achieved using techniques 
such as quorums, replication, and redundancy. 

In this paper, we demonstrate a fundamental limitation on this approach. Namely, we inves- 
tigate the possibility of fault-tolerance boosting for implementing a consensus service tolerant to 
/ stopping failures from underlying "subservices" that are tolerant to / — 1 stopping failures. We 
show that, in the setting of purely asynchronous message passing, such fault-tolerance boosting 
cannot be achieved, for any type of underlying services. That is, the availability of any set of 
distributed services, each of which tolerates up to / — 1 stopping failures, is insufficient to construct 
a consensus protocol that tolerates / failures. 

In more detail, we consider a set of asynchronous processes of which / can fail by stopping, 
communicating with each other by sending messages through reliable point to point channels. In 
addition, there is a set of services through which they can communicate implicitly. A process can 
invoke operations of a service by sending a message to one of its ports, and eventually get a response 
from the service. A process can invoke multiple operations on a service, and concurrently on other 
services. But before issuing a new operation on the same service, it must first wait for a response 
to the current invocation. Each service has a fixed set of "ports" and each port is hardwired 
to one process, where it receives invocations and returns responses to the corresponding process. 
Each service has some degree of fault tolerance, say /, which represents the number of (hardwired) 
processes accessing it that could cause it to crash. This is intended to refiect the idea that services 
are implemented by distributed algorithms, which run at a number of locations, represented by 
ports. The failure really affects the location, causing not only the failure of the process hardwired 
to the corresponding port, but also the failure of that part of the distributed implementation of 
the service which resides at that location. If a sufficient number (> /) of locations of a distributed 
implementation fail, then the implementation itself will fail. Note that this idea does not in any 
way prevent the use of arbitrary oracles in the implementation of a service, e.g., such as failure 
detectors or powerful hardware concurrent objects. 

Notice that, except for the failure behavior, our services are just like the linearizable typed 
shared objects usually considered in the literature e.g. [Her91, CJT94, Jay97, LHOO]. The services 
usually considered in the literature do not fail at all. There are only two papers we are aware of 
that consider services that can fail, [JCT98] and [AGMT95], but these papers assume the services 
are not implemented by the processes. In contrast to our model, the failures of the services and of 
the processes are not correlated in those two papers. We discuss this further in the Related Work 
section below. 

Our impossibility result says that it is impossible to build a consensus service tolerating / failures 
from services that tolerate less than / failures, independently of the number of such services, how 
powerful they are, or in what way they are accessed by the processes. Thus, for example, a strategy 
in which multiple instances of (/ — l)-fault-tolerant services are used by different subsets of the 
processes in the system, cannot work. Methods based on splitting up processes, or divide and 
conquer, also cannot work. In particular, our result holds when the underlying services include 



consensus services tolerant to / — 1 stopping failures. 

It is important to study consensus implementability because it is such a fundamental problem 
in distributed computing. In particular, there is Herlihy's [Her91] universality result for services 
that do not fail: it is possible to design a wait-free implementation of a service of any type, shared 
by n processes, using only consensus services with n ports and registers. Our boosting impossibility 
result shows a limitation on this universality result when services can fail. 

Our impossibility holds for consensus implementability, but not for implementability of weaker 
problems. Our second result is that it is is possible to boost the level of a system solving problems 
easier than consensus, like A;-consensus. In this problem processes have to agree on at most k 
different values; thus, A;-consensus reduces to consensus when k = 1. We present a simple algorithm 
(generalizing the one in [HR94, HROO]) that solves A;-consensus and tolerates / failures using k'- 
consensus services that tolerate /' less than / failures, for various values of k' and /'. For example, 
A;-consensus is solvable for 2k — 1 failures using only (consensus) services that tolerate only 1 failure 
apiece. 

Related work. Our main result is the impossibility of solving consensus /-resiliently using 
/ — 1-resilient services in an asynchronous system. There is a lot of work that studied the feasibility 
of implementing /-tolerant consensus as a function of the available components in the asynchronous 
system. The "components" can be simple message transmission channels or shared read/write 
registers, but also more powerful objects, perhaps implemented in hardware such as test&set or 
implemented with timeouts such as failure detectors, or even combinations of different kinds of 
objects. A typed shared object used in many papers is what we call a service, i.e., it has (i) a 
number of ports; (ii) a set of states of the object (or values as we call them); (iii) the set of 
operations that processes may apply through its ports; (iv) the behavior of the object in terms of 
a transition relation 5, and is assumed to be linearizable. Except that the usual assumption is that 
the components themselves are reliable. 

Work that assumed that the available components are the most basic ones is [FLP85] for just 
message transmission, and [LAA87, Her91] for shared read/write registers, and proved that it is 
impossible to solve /-tolerant consensus using only these simple components. That is, the available 
components, either channels or registers never fail. Since a consensus protocol that tolerates zero 
crash faults is trivial, our result generalizes that of [FLP85], which is a special case, for / = 1. 
Indeed, our proof technique is a generalization of the one in [FLP85]. The main difference is the 
idea of modelling the services. This introduces many more scenarios to deal with in the proof. Also, 
our events are much finer grain: in FLP, in one event a process receives a message, makes a local 
state change, and also sends any finite number of messages. Our events are I/O automata actions 
in the model of distributed systems with services. So, for example, a process receiving a message 
can only make a local state change, it cannot perform any output of any kind in the same event. 

Other papers consider more general and powerful base objects (again that never fail), and 
investigate when they can be used to solve consensus. For example, [LHOO] ask the question for 
/ = 1: Let n > 3 and 5 be a set of object types that can be used to solve one- resilient consensus 
among n processes. Can S always be used to solve one-resilient consensus among n — 1 processes? 
Many papers consider the other extreme, of / = n — 1 and deal with the robustness question 
posed in [Jay97]: can you combine objects of type T and T' that cannot be used to solve wait-free 
consensus each one by themselves in such a way as together solve wait-free consensus? 

Other papers relate implementations for different number of processes based on the same fault- 
tolerance level /. Specifically, [CJT94] show for all n > / > 2 and all sets S of shared object types 



(that include simple read/write registers) there is a /-resilient solution to n-process consensus using 
objects of types in S if and only if there is a /-resilient solution to (/ + l)-process consensus using 
objects of types in S. And [BGLROl] for k-set consensus: if there is a /-resilient implementation of 
n-ported /-set consensus from registers then there is a /-resilient implementation of / + 1-ported 
/-set consensus from registers. 

Thus, our question is orthogonal to the concerns of these previous works: while they assume 
reliable components, we consider components that are less reliable, i.e. we ask what problems can 
be solved in an /-resilient manner using components that tolerate less than / failures. We know 
of two papers that do consider shared objects that may fail. Afek, Greenberg, Merritt, Taubenfeld 
[AGMT95] study wait-free implementations using objects that can fail by returning the wrong value 
for a response. And more closely related to our work is [JCT98] that consider base objects that may 
fail by not responding (both [JCT98] and [AGMT95] consider other types of failures, like wrong 
values returned, less related to our work). In their model any number of processes may fail, and at 
most t base objects may fail. When an object fails, it stops responding. They have an impossibility 
result for solving consensus for two processes tolerating even one nonresponsive-faulty service, and 
even if that service can be nonresponsive wrt only one predetermined process. This proof works 
by a reduction from [LAA87]. This result is orthogonal to ours: the failures of the services in their 
model are unrelated to the failures of the processes, while in our model, services can fail only due 
to failures of processes. Thus, if no process fails, in our model we know no service will fail, while 
in such a situation in their model still services could fail. On the other hand, they know that at 
most one service will fail, while in ours there is no bound: if one service will fail due to too many 
processes failing, all the services with the same processes associated can also fail. 

Our main concern in this paper is on the implementation of consensus. Recall that Herlihy 
[Her91] has shown that any object can be implemented using consensus. Thus consensus is at the 
top of a hierarchy. As mentioned above, our impossibility result does not hold for objects weaker 
than consensus. 

The paper is organized as follows. Section 2 gives technical preliminaries. Section 3 gives our 
model of a distributed system, and defines the consensus problem. Section 4 presents our impossi- 
bility result for consensus. Section 5 describes the contrasting result for k-set consensus. Section 6 
discusses directions for further research and concludes. Appendix A presents some technical back- 
ground. 

2 Modeling Preliminaries 

2.1 Basic underlying model of concurrent computation 

We use the I/O automaton model [Lyn96, chapter 8] as our underlying model for concurrent com- 
putation. We assume the terminology of [Lyn96, chapter 8]. An I/O automaton A is deterministic 
iff, for each task t of A, and each state s of A, there is at most one transition {s,a,s') such that 
a G i. 

2.2 Variable types 

We define the notion of a "variable type", in order to describe allowable sequential behavior of 
services. The definition used here is a generalization of the one in [Lyn96, chapter 9]; the gener- 



alization allows nondeterminism in the choice of the initial state and the next state. Namely, a 
variable type T = {V, Vq, invs, resps, 5) consists of: 

• V, a nonempty set of states of the variable, called values, 

• Vq C V, a nonempty set of initial values, 

• invs, a set of invocations, 

• resps, a set of responses, and 

• ^, a subset of (invs x V) x [resps x V) that is "total", in the sense that, for every {a,v) G 
invs X V, there is at least one {b,v') G resps x V such that {{a,v), {b,v')) G 5. 

A deterministic variable type is one in which ^ is a mapping, i.e., for every {a,v) G invs x V, 
there is exactly one {b,v') G resps x V such that {{a,v), {b,v')) G 5. 

The reason for generalizing the notion of a variable type to allow nondeterminism is that we 
want to make our notion of "service" , defined below, as general as possible. In particular, we want 
to include the problem of A;-consensus, which can be specified using a nondeterministic variable 
type, in our consideration. 

Example. Read/write variable type: Here, V is some arbitrary set of "values," Vq = V, 
invs = {read} U {write{v) : v G V}, resps = V U {ack}, and 5 is defined to include the following 
pairs: {{read,v), {v,v)) for v eV, and {{write{v) , v' ) , {ack,v)) for v,v' G F. □ 

Example. Consensus variable type: Here, V is the set of subsets of {0, 1} having at most one 
element, Vq = 0, invs = {init{v) : f G {0, 1}}, resps = {decide{v) : f G {0, 1}}, and 5 is defined to 
include the following pairs: 

{{init{v),^), {decide{v),{v})) for v eV, and {(;lnit{v),{v'}), {decide{v'),{v'})) for v,v' eV. □ 

Example. k-consensus variable type: Here, V is the set of subsets of {0,1,... ,k} having 
at most k elements, Vq = 0, invs = {init{v) : v G {0,1}}, resps = {decide{v) : v G {0,1}}, 
and 5 is defined to include the following pairs: {{init{v),W), {decide{v'),W U {v})) for \W\ < k, 
v' eWU {v}, and {{init{v), W), {decide{v'), W) for \W\ = k, v' e W. 
Thus, the first k values get remembered, and all operations return one of these first k values. □ 

2.3 Canonical /-fault-tolerant atomic objects 

We now define the notion of canonical /-fault-tolerant atomic object, which describes the allowable 
concurrent behavior of services. The canonical f -fault-tolerant atomic object of type T for endpoint 
set J and with index k is given in Figure 1 as an I/O automaton that is parameterized by k, T, J, 
and /, where these are: 

1. A unique index k, drawn from some index set K, 

2. An underlying variable type T = {V, Vq, invs, resps, 5), which defines the sequential behavior 
of the object, 

3. A set of "endpoints" J, and 

4. The required degree of fault-tolerance /. 



A canonical atomic object accommodates concurrent invocations by different processes, i.e., 
between an invocation from and response to a particular process, the invocations of other processes 
may arrive and be processed. The use of a set of endpoints allows different services to be connected 
to different sets of processes. Thus, J will be a subset of some set / of process indices, which 
represents all the processes in the system. 

Our notion of atomic object generalizes that in [Lyn96, section 13.1.2]. We note the follow- 
ing features of our atomic objects. Each process in J can issue any invocation of the atomic 
object's underlying variable type, and can (potentially) receive any allowable response. The re- 
sult of performing an particular operation is nondeterministically selected from all results allowed 
by the transition relation 5 and the current value val of the object. Thus, the object is, in gen- 
eral, inherently nondeterministic in that it can exhibit nondeterminism that is not just due to the 
nondeterminism of its invocations by different processes. 

For every process Pi, i & J, there corresponds a task of the atomic object, which we call an 
i-task. The «-task consists of all the perform actions that carry out the operations invoked by 
Pi, together with all the possible response actions giving responses to Pj. In addition, the «-task 
contains a dummy j^ j action, which is enabled when either Pi has failed or more than / processes 
in J have failed. Thus, by inspecting Figure 1 we see that for every « G J, the task structure 
requires that the object eventually respond to an outstanding invocation by Pi, unless either Pi 
has failed or more than / processes in J have failed. In the latter case, the object is allowed to 
abstain from responding to Pi, since the internal action dummy j^ j is enabled, and can be executed 
to discharge the fairness requirement imposed by the task structure. If more than / processes have 
failed, then the object is allowed to abstain from responding to any process in J, since dummy j^ j is 
enabled for all i & J. This reflects the idea that the object is /-tolerant; once more than / failures 
have occurred (amongst processes connected to the object), then the object can itself "fail" by 
being "silent" forever from that point onwards. That is, we allow the object to violate its liveness 
property. Note, however, that the object can never violate its safety property, e.g., by returning 
values inconsistent with the transition relation 5. Note that we also allow the object to be silent if 
all processes it is connected to (i.e., in J) fail, since dummy j^ j is then enabled for all i & J. 

2.4 /-fault-tolerant atomic objects 

Given a variable type Tk and set Jk of endpoints, define an I/O automaton U to he a well-formed 
environment for Tk and Jk if and only if 

1. Its outputs are exactly the invocations of Tk at the endpoints in Jk, and its inputs are exactly 
the responses of Tk at the endpoints in Jk, and 

2. In every execution of U, for each endpoint i & Jk, there aren't two consecutive invocations at 
i without an intervening response at i. 

An I/O automaton A (a full-blown I/O automaton, with tasks) is said to be an f -fault-tolerant 
atomic object of type Tk, set Jk of endpoints, and index k, if and only if it implements the /-fault- 
tolerant canonical atomic object Sk of type Tk for Jk, in the following sense: 

1. It has the same input and output actions (including the fail actions). 

2. If f/ is a well- formed environment for Tk and Jk, then 



Canonical Atomic-Object (A;, {V, Vq, invs, resps, 5), J, /) 
Signature 

Input: 

Ui^k, a € invs, the invocations of Atomic-Object (fc, (V, Vo, invs, resps,5), J, f) by Pi, i E J 

fail^, i € J 
Output: 

bk,i, b € resps, the responses of Atoniic-Object(fc, (V, Vo, invs, resps,5), J, f) to Pi, i & J 
Internal: 

perform({a,v), (b,v'))k,i, a € invs, b € resps, v,v' G V, i E J 

dummy ^. ^, i E J 

State 

val, a value in V, initially a value in Vo 
inv — buffer, a set of pairs {i,a), for Ui an input action 
resp — buffer, a set of pairs {i,b), for bi an output action 
failed C J, initially empty 

Actions 

Input ai^k Output bk,i 

Eff: inv — buffer -f^ inv — buffer U {{i, a)} Pre: {{i,b)} E resp — buffer 

Eff: resp — buffer •«— resp — buffer — {{i,b)}; 
Internal perform({a,v), {b,v'))k,i 

Pre: {i, a) € inv — buffer /\ val = v /\5{{a,v),{b,v')) Input /ai/^ 

Eff: inv — buffer -^ inv — buffer — {{i, a)}; Eff: f ailed -^ failed VJ {i} 

val •«— «'; 
resp - 6«ifer ■(- resp - 6«ifer U { (i, 6) } Internal dummy ^^i 

Pre: i 6 failed V |/aj/ed| > / 

Eff: none 

Tasks 

For every i E J: {perform{{a,v), {b,v'))k,i '■ S{{a,v), (6, «'))} U {bi : b E resps} U {dummy ^. ^} 



Figure 1: I/O automaton for the canonical /-fault-tolerant atomic object with endpoints J and 
type T = {V, Vo, invs, resps, 5) 

(a) Any trace (3 oi A x U is also a trace of S^ x [/. (This should imply that A preserves 
well-formedness and guarantees atomicity.) 

(b) Any fair trace (3 oi A x U is also a fair trace of S^ x U. (This should imply that the 
implementation is /-fault-tolerant.) 

3 Model of Computation 

The model we consider for our problem consists of a collection of processes, channels, and services, 
which we define formally below. For the rest of this section, we fix: 

• /, K, finite index sets, and 

• T, a variable type for the entire system, representing the problem being solved, and 
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fail. 




receive{m)ij 



Figure 2: The interfaces of process Pj, channels Cij,Cj^i and service S^ in the complete system. 

• M, a message alphabet. 

A distributed system with services (DSS) for 7, K, T, M is the parallel composition of I/O automata 
(see [Lyn96, chapter 8]) of the following kinds: 

1. processes Pi, i E I, and 

2. channels Cij, i,jEl,i^ j, and 

3. services S^, k E K. We let Tk denote the variable type and Jk '^ I denote the set of endpoints 

of service 5^; . 

Processes interact only via channels: Process Pi communicates with process Pj over unidirectional 
channel Cjj. Processes also interact with services: Process Pi can invoke service S^ provided that 
i is in Sk?> set of endpoints. Services do not communicate directly with one another; however, they 
interact indirectly via common processes. Figure 2 shows the interfaces that a process, channel, 
and service have. In the remainder of this section, we provide more details about the components. 



3.1 Processes 

Process Pi, i E I has the following kinds of inputs and outputs: 

1. Inputs ai and outputs hi, where a is an invocation of type T and 6 is a response of type T. 
These represent Pj's interactions with its own clients (the outside world). 

2. Outputs send{m)i^j and inputs receAve{m)j^i, m G M, which connect to channels Cjj and 
Cj^i, respectively. 

3. For every service S^ such that i E Jk, outputs ai^^-, where a is an invocation of type Tk, and 
inputs hk i, where 6 is a response of type Tk- 



4. Input fail^. 

We assume that Pi observes well-formedness for each separate service Sk'- it does not issue two 
invocations on S^ without receiving a response to the first one. However, Pj is allowed to issue an 
invocation on a service without waiting for previous invocations on other services to respond. That 
is, Pi can issue concurrent invocations to different services, but not to the same service. We also 
assume that the client of Pi is well- formed with respect to Pj: it does not issue two invocations to 
Pi without receiving a response to the first one. We assume that Pj has only a single task, which 
therefore consists of all the locally-controlled actions of Pj. We assume that in every state, some 
action in that single task is enabled. We assume that the faili input action sends Pj into some 
kind of state from which (from that point onward), no output actions are enabled. However, other 
locally-controlled actions may be enabled — in fact, by the restriction just above, some such action 
must be enabled. This action might be a "dummy" action, as in the fault-tolerant atomic objects 
defined earlier. 

3.2 Services 

We define a /-fault-tolerant service of a particular variable type Tk for a particular set J^ of 
endpoints, to be simply the canonical /-fault-tolerant atomic object of type Tk for Jk- Let Tk-invs, 
Tk-resps denote the set of invocations, responses, respectively, of the variable type Tk- 

The safety properties of a service Sk are determined by its finite traces, which are determined 
by its start states, transitions, and signature. These are all part of the definition of the service as an 
I/O automaton. Likewise, the liveness properties of a service Sk are determined by the automaton 
task structure and the usual conventions for fair executions of I/O automata. 

We say that Pj has an outstanding invocation to a service Sk iff either (1) the invocation buffer 
of Sk contains an invocation of the form («, a), a G Tk-invs, or (2) the response buffer of Sk contains 
a response of the form («, 6), 6 G Tk-resps. 

We say that a service Sk is silent along an execution a iff the only actions that Sk executes 
along a are dummy actions. 

3.3 Channels 

Channel Cjj is a FIFO reliable channel, as defined in [Lyn96, chapter 14]. Its inputs are send{'m)ij 
actions, which are outputs of Pj, and its outputs are receive{'m)ij actions, which are inputs of Pj- 
A channel has exactly one task, consisting of its locally controlled actions. 

3.4 The task structure of a complete system 

The ordinary assumptions about I/O automata mean that the system executes using a "weakly 
fair" scheduling discipline: in any execution, every task that is continuously enabled gets selected 
for execution infinitely often. (Thus, an enabled task is eventually either disabled or executed.) 
For a service Sk, there is a task for each i G Jk, consisting of the actions {perform {{a, v), {b,v'))k,i : 
5{(a,v),{b,v'))}U{bi :6g resps}U{dummyi.i}, see Figuiel- For a process Pj there is a single task, 
consisting of all the locally controlled actions of Pj. Likewise, for a channel Cjj, there is a single 
task, consisting of all the locally controlled actions of Cjj, i.e., the receive{m)ij actions, m G M. 



Since a task of a component contains only its locally controlled actions, we infer from the 
signature compatibility condition for I/O automata that the tasks define a partition of the set of 
all actions in the system, except the init{v)i and fail^ actions; each action occurs in exactly one 
task. 

With this task structure, the weak fairness discipline implies that every message that is sent 
is eventually received, every process executes infinitely often along an infinite fair execution, and 
every outstanding invocation (of a service) eventually receives a response. 

We introduce a naming scheme for tasks as follows. The single task of Pj, « G / is called pti. The 
single task of channel Cjj, i,j & I, i ^ j, is called cijj. The task of service S^, k & K for i & Ji^ is 
called stk,i. We define PT = {pU : i & I},CT = {cttj : i,j & I,i^ j}, ST = {stk,i : k & K,i e J^}, 
and T = PT U CT U ST. We call the tasks pti (i G /) process tasks, the tasks cijj («, j & I,i ^ j) 
channel tasks, and the tasks stk,i {k & K,i & J^) service tasks. 

For any action a except an init{v)i or fail^, we define task{a) to be the unique t such that i G T 
and a & t, i.e., task{a) is the name of the task containing a. We define task{init{v)i) = init{v)i, 
and task{faili) = fail^, i.e., we consider these actions as being the sole members of singleton tasks, 
and overload the name of the action as the name of the corresponding task. If e is a channel task 
ctij, then let receiver (e) be the process Pj. 

3.5 The Consensus problem 

The "traditional" specification of /-fault-tolerant consensus is given in terms of a set {Pi,i G /} 
(/ is an index set) of processes that each starts with some value Vi drawn from {0, 1}. Processes 
are subject to crash failures [Sch90], that disable the process from producing any output.^ As a 
result of engaging in a consensus algorithm, each nonfaulty process eventually "decides" on a value 
from {0, 1}. The behavior of processes is required to satisfy the following three conditions [Lyn96, 
chapter 6]: 

Agreement No two processes decide on different values. 

Validity The value decided on is the initial value of some process. 

Termination In every infinite fair execution, all nonfaulty processes eventually decide. 

We specify the consensus problem in a slightly different way. We say that a DSS S solves f -fault- 
tolerant consensus for I if and only if S is an /-fault-tolerant atomic object of type consensus 
(Section 2.2) for endpoint set /. 

We now show that any system that meets our definition also meets the traditional one. We 
argue that the /-fault-tolerant canonical consensus object for endpoint set / satisfies the three 
conditions above (with a slight variation of the termination condition). 

Prom the definition of the consensus variable type, each process in I has two invocations, init{0), 
init{l) and two responses, decide{0), decide{l). By inspecting the consensus variable type given in 
Section 2.2, we see that the value of the variable is initially 0, and on invocation mii(O) can change 
from to {0}, and on invocation init{l) can change from to {1}, and is stable once it is different 
from 0. It is also clear that any decide{0) response is only issued by the object when the variable 

^ Crash failures are usually defined as disabling the process from executing at all. However, the two definitions are 
equivalent with respect to overall system behavior. 
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has value {0}, and any decide{l) response is only issued by the object when the variable has value 
{1}. Hence, after the first decide (0) response, all subsequent responses will be decide (0), and after 
the first dec«de(l) response, all subsequent responses will be dec«de(l). So, the canonical consensus 
object satisfies the agreement condition. If all invocations are init{0), then the only possible change 
of the variable is from to {0}. Hence, all responses will be dec«de(0). Likewise if all invocations 
are init{l), then all responses will be dec«de(l). Otherwise, there are both init{0) and init{l) 
invocations. Hence, in all cases, the value decided on is the value occurring in some invocation. 
Hence, the canonical consensus object satisfies the validity condition. If at least one process invokes 
the /-fault-tolerant canonical consensus object, then the value of the variable will eventually be 
either {0} or {!}, provided that less than / processes fail, and that the scheduling is weakly fair, as 
discussed in Section 3.4. Hence, all nonfaulty processes that invoke the object will receive a decide 
response, along fair executions in which no more than / processes fail. Processes that do not invoke 
the object will not receive a response, even if they are nonfaulty. That is, processes that do not 
invoke the object (with an init{v) action) do not participate in the consensus algorithm, and hence 
are not required to have an initial value. This is a slightly different condition than the traditional 
termination condition, which requires that all nonfaulty processes do have an initial value, and that 
they all eventually decide. Here, only the nonfaulty processes that "participate," by invoking the 
object, will receive a decision. 

Since any system S that solves solves /-fault-tolerant consensus for I can only exhibit behaviors 
(in composition with a well-formed environment) that are a subset of the behaviours of the /-fault- 
tolerant canonical consensus object, the desired conclusion follows. 

4 The Impossibility Result 

The problem we address is to design a system, as given in Section 3, which is an /-fault-tolerant 
atomic object (Section 2.4) of type consensus for some (arbitrary) set / of endpoints. We show 
that, when the services in the system are restricted to be (/ — 1) -fault-tolerant atomic objects, 
that this problem is impossible to solve. The services can have arbitrary types, and can have as 
endpoints any subset of /. Thus, techniques based on quorums, replication, and redundancy, could 
all be implemented within our model. Our result implies that none of these approaches would help: 
a limitation on the fault-tolerance of the underlying services is also a fundamental limitation on 
the fault-tolerance of any consensus service that can be built from these underlying services. 

Since we now restrict attention to systems that are consensus objects, the inputs Oj and outputs 
bi that represent Pj's interactions with its own clients are now instantiated as the inputs init{0)i, 
init{l)i, and the outputs decide{0)i, decide{l)i, for the single consensus client that Pi now interacts 
with. 

4.1 Main result and proof assumptions 

The main result of the paper is: 

Theorem 1 Let I be an arbitrary endpoint set such that \I\ > 2, and let f be such that 1 < / < 
|/|. Then there does not exist a distributed system with services that is an f -fault-tolerant atomic 
consensus object for endpoint set I, if the services are (/ — 1) -fault-tolerant. 
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Note that the services can be of any variable type. We assume in the sequel, that such a DSS, P, 
exists and derive a contradiction. 

We assume that all the processes of P are deterministic automata, as defined in Section 2.1. 
Since channels are FIFO, they are already deterministic. We assume a slightly weaker condition 
for services, namely that variable type of each service is deterministic, i.e, the relation 5 of the 
underlying variable type is a mapping. For an impossibility proof, these assumptions are made 
without loss of generality, since processes and services can be made to satisfy the above conditions 
by removing a subset of the locally-controlled transitions. Hence, if an unrestricted solution exists, 
then a solution satisfying our assumptions also exists. 

4.2 Terminology used in the proof 

4.2.1 Transitions 

A transition is a triple (s,a,s'). We define /irsf(.s, a, .s') = s, action (s, a, s') = a, last{s,a,s') = s'. 
The participants of a locally controlled action (i.e., not an init{v)i or fail^ action) a of the system 
are all automata with a in their signature: participants (a) = {A \ a G acts{A)}. The participants 
of a transition (s, a, s') are the participants of its action: participants (s , a, s') = participants (a) . 

If the action a of a transition is an output action of some component A (process or service, since 
channels do not have internal actions), then we say that the transition is an output transition of 
A. We define internal transition of A similarly. Due to I/O automaton signature compatibility, a 
transition can be the output or internal transition of at most one component. Furthermore, due to 
the structure of the system, as given in Section 3, every transition, with the exception of transitions 
due to the execution of the init{v)i inputs to Pj, and fail^ actions, is either an output transition or 
an internal transition of exactly one component. 

4.2.2 Tasks and scheduling 

We say that a task e is applicable to a global state s iff some action of e is enabled in state s. If 
q; is a finite execution, then we say that e is applicable to o; iff e is applicable to last{a). Thus, 
if e is an applicable channel task cijj, then the corresponding channel Cjj must be nonempty, so 
that a message can actually be delivered. If e is an applicable service task st^^i, then either the 
invocation buffer of service Sk must contain an invocation from process Pj, or the response buffer 
of Sk must contain a response to Pj, or the dummy^^^ action must be enabled. We assume, for 
technical convenience, that a process always has an enabled locally controlled action, and so a 
process task is always applicable. 

An applicable task e, together with the current global state, determines a unique transition 
(arising from the scheduling of task e in the current state) since processes and channels are de- 
terministic, and the variable type underlying a service is also deterministic. We denote this tran- 
sition as transition{e,s). Let transition {e,s) = (s,a,s'). Then, we apply the notation defined in 
Section 4.2.1 to transition {e,s) as follows: first{e,s) = s, action{e,s) = a, last{e,s) = s'. We 
abbreviate last{e,s) by e{s). We note that transition{e,s), first{e,s), action{e,s), last{e,s) are 
defined if and only if e is applicable to s. 

We note that when e is a channel task, then transition {e,s) always causes a change of state, 
i.e., e{s) / s, since some message is delivered by the channel. When e is a service task st^^i, then 
transition{e, s) causes a change of state unless it corresponds to the execution of a dummy j^ j action. 
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When e is a process task, then transition {e, s) may or may not cause a state change. This would 
depend on the transition structure of the process, about which we make no assumptions. 

4.2.3 Executions 

Define an initialization of P to be a finite execution containing exactly |/| actions, which moreover 
are all init{vi)i actions, one for each « G 7. Define an execution o; of P to be input- first iff it has an 
initialization as a prefix, and otherwise contains no init actions. If o; is a finite execution, then an 
extension of a is an execution a' such that o; is a prefix of a' . Define a finite input-first failure- free 
execution a to be 0-valent if (1) some input-first failure-free extension of a contains a decide{0)i 
action, for at least one « G /, and (2) no input-first failure-free extension of a contains a decide{l)i 
action, for any « G 7. The definition of 1-valent is analogous. Define a finite failure-free execution 
a to be univalent iff it is either 0-valent or 1-valent. Define a finite input-first failure- free execution 
a to be bivalent iff it has some input-first failure- free extension that contains a decide{0)i action, 
for at least one i G 7, and some input-first failure-free extension that contains a decide{l)i action, 
for at least one « G 7. 

Since the assumed /-fault-tolerant atomic consensus object P is an I/O automaton, we can 
view its transition relation as defining a labeled directed graph whose nodes are the states of P and 
which contains a directed edge from s to s' labeled with a iff {s,a,s') is in the transition relation 
of P. This graph is called the global state transition graph of P. Let G{P) be the subgraph of 
the global state transition graph of P obtained as follows: (1) include every state that lies along 
an input-first execution, and (2) include all the transitions of P that connect the states that are 
included by virtue of (1). 

4.2.4 Schedules 

A schedule is a finite sequence of task names drawn from T U {init {v)i, fail, ^ : v E {0, l},i G 7}. 
Let a = 6162 . . . e^i be a schedule, and s be a global state, such that, ei is applicable to s, 62 is 
applicable to ei(s), and, generally, Cj is applicable to ei_i(ei_2(. . . (ei(.s)) . . . )) for all i, 1 < i < n. 
Then, we say that a is applicable to s, and we let a{s) denote en{en-i{. ■ ■ {ei{s)) ...)). A schedule 
a is applicable to a finite execution o; iff cr is applicable to last {a). In this case, we let a (a) denote 
the resulting extension of a. 

Let a = soaiSia2S2 ■ ■ ■ Si-iOiSi be a finite execution. Then, we define the schedule 
schedule{a) = task (ai) task (02) ■■■ task (oi). That is, for each action in a, we take the name of 
the task containing the action. schedule{a) then consists of these task names in the same order as 
their corresponding actions. 

4.3 The proof 

Our proof will build up a series of lemmas establishing certain constraints on G{P). We start with 
the basic commutativity situation illustrated in Figure 3. 

Lemma 2 Let s be any global state of the f -fault-tolerant atomic consensus object P, and let ei, 
62 be tasks such that 

1. ei, 62 are both applicable to s, and 
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e2(si), ei(s2) 
Figure 3: Commuting tasks w.r.t. a state s. 

2. participants{ei,s) n participants {e2,s) =0. 

Let ei(.s) = si, and 62(5) = S2- Then, 62 is applicable to si, and e\ is applicable to S2, and 
62(51) = ei(s2). 

Proof. By assumption (ei,s) and (e2,s) only affect the state of different components. It fol- 
lows that 62 is applicable to si, and that ei is applicable to S2- By determinism, it follows that 
participants {ei,s) = participants {ei,S2), and that (ei,s) and (ei,S2) are the same transition "lo- 
cally," i.e, they effect exactly the same state changes in the components in participants {ei,s). 
Likewise for (e2,s) and (e2,si). Thus, the accumulated state changes of (ei,s) followed by (e2,si) 
are the same as the accumulated state changes of (e2,s) followed by (ei,S2). Hence the lemma 
holds. Figure 3 illustrates the proof. □ 

Lemma 3 The f -fault-tolerant atomic consensus object P must have a bivalent initialization. 

Proof. Recall that we assume / > 1 (Section 4.1). The argument is then exactly the same as that 
in the proof of Lemma 12.3 in [Lyn96, chapter 12]. D 

Suppose there exists a finite input-first failure-free execution as, and states s, s', s", sq, si, 
and tasks e, e' which are related as given by Figure 4. We call such a configuration a hook, after 
[CHT96]. We say that the hook starts in state s, and we call as the stem of the hook. We also 
admit as a hook a configuration in which the 0-valent and 1-valent states are interchanged. 



«s; 



Lemma 4 Let as be a finite input-first failure-free bivalent execution of G{P), and let first {as 
Sstart, last{as) = s. Let e be a task of P applicable to as. Let 

U = {au I au = cr(Q;s), a is a finite failure-free schedule applicable to Ug and not containing e}, 

V = {e(a„) I a„ G [/ and e is applicable to q;„}. 
Then either (1) V contains a bivalent execution, or (2) G{P) contains a subgraph which is a hook 
starting in Sstart, o-s given by Figure 4- 
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So (0-valent) 




Si (1-valent) 



Figure 4: A hook starting in s. 

Proof. We assume both the antecedent of the lemma and the negation of (1), and estabUsh (2). 

Now e is either a channel task, process task, or service task. If e is a channel task cijj, then 
applicability of e to s means that channel Cjj contains a message in state s. Thus, e is also ap- 
plicable to any state reached from s by a schedule not containing e, since the message remains in 
Cjj as long as cijj is not scheduled. If e is a process task, then e is applicable to any state, by 
our assumption that a process always has some enabled locally controlled action. If e is a service 
task stk^i, then applicability of e to s means that either service S^ has a pending invocation from 
process Pi in state s, or dummy j^ j is enabled. Thus, e is also applicable to any state reached from 
s by a schedule not containing e, since the invocation (if present) remains pending as long as st^^i 
is not scheduled, and dummy j^ j remains enabled once it is enabled. We have therefore shown, 

e is applicable to every execution in U. (a) 



Since as is bivalent, there exists a 0-valent extension axg of as and a 1-valent extension ax^ of 
as- For i G {0, 1}, we argue as follows. 



CASE 1: axi G U. Let a„^ 
since ax, G U. 



e{axi)- Hence q;„. is i-valent, since Ofj;. is i-valent. Also, q;„. G V, 



CASE 2: axi ^ U. Then, e was applied in extending as to Ofj;. . Let q;„. be the unique extension 
of as whose last action has task e. q;„. is unique due to our assumptions in Section 4.1 about the 
deterministic behavior of processes and variable types. Hence q;„. = e{a's) for some extension a's 
of as- Hence q;„. G F by definition of V. Since (1) is false by assumption, V contains no bivalent 



executions. Hence q;„. is univalent. But ax^ 
'i-valent. 



is «-valent and is an extension of a„. . Hence a„ 



IS 
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Um-1 



Wm-1 



Vl (Wm) 



Figure 5: Existence of the hook. 

Thus, in both cases, we have that o;^. G V and o;^. is «-valent. Moreover, this holds for both 
i = and i = 1. Thus 



there exist 0-valent a^^ G V and 1-valent a^^ 



GF 



(b) 



Let ay = e(Q;s), and let v = last{ay). Hence ay G V, and so ay is univalent by the assumption that 
(1) is false. Without loss of generality, let ay be 0-valent. By (b), there exists ay^ G V which is 
1-valent. Let a., be an execution in U such that e(um) = ay, , and let Um = last(ai,). Hence, we 



*«i ) 



have the situation depicted in Figure 5, since q;„^ is an extension of ag. (The state s is the same 



state in Figures 4 and 5). Consider the (unique) execution fragment 7 such that a^ 



a. 



-7. By 



(a), e is applicable to every state along 7. Since the resulting executions are all in V by definition, 
they are all univalent, by assumption. Since ay is 0-valent and ay^ is 1-valent, it follows that there 
exist two such executions, ao and ai such that ao is 0-valent, ai is 1-valent, and ao, a\ result from 
applying e to adjacent states along 7. The subgraph of G(P) generated by taking the "union" of a^ 
and a\ (i.e., take all states and transitions occuring in one, or both, of q;o, «i) is then the desired 
hook. D 

Lemma 5 G[P) does not contain as a subgraph a hook whose stem is a finite input-first failure-free 
execution. 

Proof. Our proof is by contradiction. We assume that G{P) does contain such a hook, and establish 
that P is not a /-fault-tolerant atomic consensus object, contrary to assumption. 

Without loss of generality, we assume the configuration in Figure 4. For each state except Sgtart-, 
we let a subscripted with the state name denote the unique finite execution which is contained in 
the hook and which ends in that state: agi is the stem of the hook, Ofsg ends in sq, cVsi ends in si, 
and agii ends in s" . 

We remark that agi cannot contain any decide actions, since it is bivalent, and this would 
otherwise violate the agreement property. We first establish Claims 1-3. 
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Claim 1: e / e'. 
Suppose not. Then, by determinism (Section 4.1), we have sq = s". Now si is reachable from s", 
and si is 1-valent. Hence, s" is either bivalent or 1-valent. sq however, is 0-valent. Hence we have 
a contradiction. So, claim 1 is established. 

Claim 2: \participants{e, s')\ < 2, \participants{e' , s')\ < 2. 
Prom the structure of a DSS (Section 3), we see that every output action of some component is an 
input action of at most one other component. The claim follows. 

Claim 3: \participants{e,s') (1 participants {e' ,s')\ < 1. 
Prom Claim 2, we immediately have that {participants {e,s') Ci participants {e' ,s')\ < 2. Suppose 
{participants (e, s') (1 participants (e' , s')\ = 2. Prom Claim 1, we know that e / e'. Hence, it must 
be that, for some distinct components Ci, C2, action{e,s') is an output action of Ci and an input 
action of C2, action {e' , s') is an input action of Ci and an output action of C2. Since services and 
channels have no actions in common, the only possibilities for this are: 

• {Ci, C2} = {Pi, Sk} for some Pj, Sk- 
This violates well-formedness of Pi for S^ ■ 

• {Ci, C2} = {Pi, Cij} for some Pj, Cjj. 

No output action of Cjj is an input action of Pj. 

• {Ci, C2} = {Pj, Cj^i} for some Pj, Cj,^. 

No output action of Pj is an input action of Cj^i. 

Since all three cases lead to a contradiction, the claim is established. 

From Claim 3, we have four possibilities for participants (e, s') (1 participants (e' , s'). To complete 
the proof of the lemma, we consider each separately. 

CASE 1: participants {e, s') Ci participants {e' ,s') = ^. Hence, the antecedent of Lemma 2 holds 
for s = s', ei = e, and 62 = e'. Hence, e' is applicable to sq, and e'(.so) = si. Hence, e'{asf^) and 
Qfsi have at least one infinite fair extension with a common suffix. Since agi does not contain any 
decide actions, it follows that the suffix must contain decide actions. Now Ofsg is 0-valent and ag^ 
is 1-valent. Hence, no matter what decide actions this common suffix contains, it will violate the 
valencies of at least one of Ofsg, cksi- 

CASE 2: participants (e, s') (1 participants (e' , s') = S^- 

Subcase 2.1: At least one of action{e,s'), action{e' ,s') is not a perform action of Sk- Hence 
at least one of these is an invocation or a response. Now invocation and response actions do not 
change the value of the underlying variable of Sk- 

Since both these actions are enabled in s', it follows that the enablement of neither action 
depends on the prior execution of the other action (this might be the case for certain invocation, 
perform or perform, response pairs of actions, but not here). Hence, from Pigure 1, we see that 
these actions commute, in that their order can be reveresed and the same final global state wil 
result. Hence, e' is applicable to sq, and e'(so) = si. Hence, e'{asg) and as^ have at least one 
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infinite fair extension with a common suffix. Since Ofg/ does not contain any decide actions, it 
follows that the suffix must contain decide actions. Now Ofsg is 0-valent and as^ is 1-valent. Hence, 
no matter what decide actions this common suffix contains, it will violate the valencies of at least 
one of QfgQ, Ofsj. 

Subcase 2.2: Both of action{e,s'), action{e\s') are perform actions of Sk- Since agi is 
bivalent, then, under the assumption that P solves /-fault-tolerant consensus, agi cannot contain 
any decide actions, since that would violate agreement. Hence, Ofsg does not contain any decide 
actions either, since action{e,s') is not a decide. 

Let a" be an infinite fair execution that extends Ofsg, and let a' be the suffix of a' starting in 
state s'. Furthermore, let a' be chosen such that: 

1. The first / actions along a' are failj actions for / different j G J^ 

2. For every ocurrence of an action a along a', and every i G /, if task{a) = st^^i, then a = 
dummy 1^ j. That is, whenever stk,i is scheduled along a, the dummy j^ j action is chosen. Since 
dummy 1^ j is enabled at all states of a' except the first, it is certainly possible to always choose 
to schedule the dummy j^ j action in this way, along a' . 

Since P is /-tolerant, / > 1, a decide{v)j action, for every nonfaulty process must occur along 
a'. Let a'^ be the prefix of a' ending in the state just after the first such decide{v)j action. Let 
a = schedule (a'^). From a, derive the schedule a' by removing: 

1. Every occurrence of a fail^, and 

2. Every occurrence of stk,i for alH G / (these all correspond to dummy), j actions in a'^), 

It is clear that a' is a failure-free schedule. Since, in a, the transitions corresponding to the above 
task occurrences do not induce any change of state other than to S^, which is silent, it follows that 
a' is applicable to sq, and that a'{asg) contains a single decide action. 

By the case condition, sq and si differ only in the state of S^. Since processes and channels are 
deterministic, and since services have a deterministic type and also behave as given by Figure 1, 
we can see that a' is applicable to si, and that a'{asi) is the same as a'{asg), with the exception of 
the local state of S^. In particular, a'{as^) and a'{asg) contain the same action subsequence. So, 
a'{asi) and a'{asg) contain the same single decide{v)i action, for some v G {0, 1}, Choosing v = 
contradicts the 1-valency of si, and choosing v = 1 contradicts the 0- valency of sq. 

CASE 3: participants {e,s') (1 participants (e' , s') = Cij. Since Pi and Cjj are deterministic, 
and e / e', it follows that one of action{e,s'), action{e' ,s'), is a send{'m)ij, and the other is 
a receive{'m')ij, for some m,m' G M. Since these are both enabled in s', it follows from the 
definition of a FIFO channel (see [Lyn96, chapter 14]) that transition {e,s') and transition {e',s') 
commute. The remainder of the argument is similar to Case 2.1. 

CASE 4 '■ participants (e, s') n participants (e' , s') = Pj. 
Since as' is bivalent, then, under the assumption that P solves /-fault-tolerant consensus, as' 
cannot contain any decide{) actions, since that would violate agreement. 

Let a" be an infinite fair execution that extends ag/ , and let a' be the suffix of a' starting in 
state s'. Furthermore, let a' be chosen such that: 
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1. The action along a' that starts in s' is fail^, and 

2. No failj actions, j / «, occur along a' , and 

3. For every action a, and every occurrence of a along a', if task{a) = st^^i for some k G K, then 
a = dummy f.^. That is, whenever stk^i is scheduled along a, the dummy). ^ action is chosen. 
Since dummy j^ j is enabled at all states, except the first, of any execution fragment that starts 
with /ai/j, it is certainly possible to always choose to schedule the dummy j^^ action in this 
way, along a' . 

Since P is /-tolerant, / > 1, a decide{v)j action, for every j ^ i must occur along a' . Let a'^ be 
the prefix of a' ending in the state just after the first such decide{v)j action. Let a = schedule{a'^). 
Prom a, derive the schedule a' by removing: 

1. The single occurrence oi fail^, and 

2. Every occurrence of stk,i for all k & K, (these all correspond to dummy i. ^ actions in a'^), and 

3. Every occurrence of ctj^i, for all j & I, j ^ i 

Since the only fail action along a is fail^, it is clear that a' is a failure- free schedule. Since, in 
a, the transitions corresponding to the above task occurrences do not induce any change of state 
other than to Pj, which has failed, it follows that a' is applicable to s', and that a' (as') contains a 
single decide action. We now establish Claims 4.1 and 4.2. 

Claim 4-1: 

1. cr' is applicable to sq. 

2. Let 7 be the suffix of a'{asi) starting in s', and let 70 be the suffix of a'{asg) starting in sq. 
Then 7, 70 contain the same decide actions. 

We establish the claim by case analysis on the possibilities for action{e,s'). From the case 4 
condition, we have that Pi G participants {e,s'). This restricts the possibilities for action{e,s') to 
the following. 



Subcase 4. 1.1: action{e, s') = ai^^-, o, G Tk-invs. By definition, a' contains no occurrence of 
stk^i- Hence, 7 contains no action in st^^i- Let 700 be the same as 7 except that, for corresponding 
states along 700, the invocation buffer of Sk contains additionally the invocation {i,a). Since 700 
contains no action in st^^i, this extra invocation is never processed (by a performQ action) along 
700- Hence, the state-action-state triples along 700 are actual transitions of P (i.e., elements of 
steps{P)). Thus, 700 is an actual execution fragment of G{P). Furthermore, the first state of 700 
is So, and schedule {'-^qq) = a' . Hence a' is applicable to sq. Now 700 is the suffix of cr'(so) starting 
in So. Also, 7 and 700 contain the same subsequence of actions, and so in particular contain the 
same decide actions. Letting 70 = 700 establishes the claim in this case. 



Subcase 4-1-2' action{e,s') = b^^i, b G Tk-resps. By definition, a' contains no occurrence of 
pti nor of stk,i- Let 7 be the suffix of a' (as') starting in s'. Hence, 7 contains no action in pti nor in 
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stk^i- Let 700 be the same as 7 except that, for corresponding states along 700, the response buffer 
of Sk is missing the response (i, 5), and the state of Pi is the result of executing input action bk,i in 
state s'. 

We now argue that every state-action-state triple along 700 is in steps (P), i.e, is an actual tran- 
sition of P. Since 700 contains no actions in pti, this difference in Pj's local state does not cause 
any state-action-state triple along 700 to not be a transition of P, since no action along 700 either 
depends on (for enablement) nor changes Pj's local state. Likewise, since 700 contains no actions 
in stk^i, then the difference in the response buffer of S^ cannot cause any state-action-state triple 
along 700 to not be a transition of P, since no action along 700 either depends on (for enablement) 
those elements of S^s response buffer of the form («,6), nor does any such action add or remove 
elements of the form («,6) to S^s response buffer. Thus, 700 is an actual execution fragment of 
G{P). Furthermore, the first state of 700 is so, and schedule {'-^qq) = a' . Hence a' is applicable to 
So. Now 700 is the suffix of cr'(so) starting in sq. Also, 7 and 700 contain the same subsequence 
of actions, and so in particular contain the same decide actions. Letting 70 = 700 establishes the 
claim in this case. 



Subcase 4-1-3: action{e, s') = send{'m)ij, m G M. By definition, a' contains no occurrence 
oipti- Let 7 be the suffix oi a'{asi) starting in s'. Hence, 7 contains no action inpij. Also, message 
m is not received by Pj along 7, since it was not sent. (Wlog, we assume that all messages are 
tagged with unique identifiers. This is for the purpose of the proof only, and is not a restriction 
on the assumed system P.) Let 700 be the same as 7 except that, for corresponding states along 
700i Cij contains in addition message m at its end (i.e., m is the "last" message in Cjj, recall that 
channels are FIFO), and the state of Pj is the result of executing output action send{'m)ij in state 
s'. 

We now argue that every state-action-state triple along 700 is in steps (P), i.e, is an actual tran- 
sition of P. Since 700 contains no actions in pti, this difference in Pj's local state does not cause 
any state-action-state triple along 700 to not be a transition of P, since no action along 700 either 
depends on (for enablement) nor changes Pj's local state. Likewise, the difference in the contents of 
Cij cannot cause any state-action-state triple along 700 to not be a transition of P. The only triples 
that could possibly be affected are those whose action is receive{m')ij for some m' G M. But all 
such triples will correspond to the reception of the message m' actually at the head of Cjj (in the 
initial global state of the triple), since the only difference in the contents of Cjj is that an extra 
message has been appended at the rear of Cjj. In other words, Cjj delivers the same sequence of 
messages along 700 that it does along 7. Hence, all these triples will be actual transitions of P. 
Thus, 700 is an actual execution fragment of G{P). Furthermore, the first state of 700 is so, and 
schedule {'joo) = a'. Hence a' is applicable to sq. Now 700 is the suffix of cr'(so) starting in sq. Also, 
7 and 700 contain the same subsequence of actions, and so in particular contain the same decide 
actions. Letting 70 = 700 establishes the claim in this case. 



Subcase 4-1-4' action{e,s') = receive {m)j^i, m G M. By definition, a' contains no occurrence 
oi pti nor of ctj^i. Let 7 be the suffix of a' {as') starting in s'. Hence, 7 contains no action in pti 
nor in ctj^i. Let 700 be the same as 7 except that, for corresponding states along 700, Cj^i is missing 
the message m at its head, and the state of Pj is the result of executing input action receive{m)j^i 
in state s' . 
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We now argue that every state-action-state triple along 700 is in steps{P), i.e, is an actual tran- 
sition of P. Since 700 contains no actions in pti, this difference in Pj's local state does not cause 
any state-action-state triple along 700 to not be a transition of P, since no action along 700 either 
depends on (for enablement) nor changes Pj's local state. Likewise, the difference in the contents 
of Cj^i cannot cause any state-action-state triple along 700 to not be a transition of P, since 700 
contains no action in ctj^i. Thus, 700 is an actual execution fragment of G{P). Furthermore, the 
first state of 700 is sq, and schedule{'yoo) = a'. Hence a' is applicable to sq. Now 700 is the suffix of 
cr'{so) starting in sq. Also, 7 and 700 contain the same subsequence of actions, and so in particular 
contain the same decide actions. Letting 70 = 700 establishes the claim in this case. 



Subcase 4-1-5: action{e,s') = decide{v)i or action{e,s') is an internal action of Pi- By 
definition, a' contains no occurrence of pij. Let 7 be the suffix of a'{asi) starting in s'. Hence, 7 
contains no action inpij. Let 700 be the same as 7 except that, for corresponding states along 700, 
Cj^i and the state of Pj is the result of executing action{e, s'). 

We now argue that every state-action-state triple along 700 is in steps (P), i.e, is an actual tran- 
sition of P. Since 700 contains no actions in pti, this difference in Pj's local state does not cause 
any state-action-state triple along 700 to not be a transition of P, since no action along 700 either 
depends on (for enablement) nor changes Pj's local state. Thus, 700 is an actual execution fragment 
of G{P). Furthermore, the first state of 700 is sq, and schedule {'Joq) = a' . Hence a' is applicable 
to 5o. Now 700 is the suffix of cr'(so) starting in sq. Also, 7 and 700 contain the same subsequence 
of actions, and so in particular contain the same decide actions. Letting 70 = 700 establishes the 
claim in this case. 

From our definition of distributed system with services, we see that the above are all the possible 
cases for action{e, s'). Having established Claim 4.1 in each case, we conclude that it holds generally, 
(end proof of Claim 4.1) 

Claim 4.2: 

1. a' is applicable to si. 

2. Let 7 be the suffix of a'{asi) starting in s', and let 71 be the suffix of a'{asj^) starting in si. 
Then 7, 71 contain the same decide actions. 

From the case 4 condition, we have that Pj G participants (e' , s') . Hence, we can apply exactly the 
same argument as used in the proof of Claim 1 to conclude that: 

1. cr' is applicable to s". 

2. Let 7" be the suffix of a'ias") starting in s" . Then 7, 7" contain the same decide actions. 

From the case 4 condition, we have that Pj G participants {e,s'). Hence, e = pti, or e = ctj^i, 
or e = stk,i, with action{e,s') = bk,i for some b G Tk-resps. If e = pti or e = ctj^i, then clearly 
Pi G participants (e, s"). If e = st^^i, with action{e, s') = b^^i for some b G Tk-resps, then, by well- 
formedness of Pi w.r.t. Sk, and Pj G participants (e' ,s'), it follows that action{e' ,s') / ak,i for all 
a G. From e / e' it follows that action{e',s') / bk^i for all b G Tk-resps, since otherwise we would 
have e' = e = stk,i- Hence, from Pj G participants (e' ,s'), we conclude Sk ^ participants (e' ,s')- 
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Hence, the local state of S^ is the same in s' and s", i.e., s'lS^ = s"\Sk- Since action{e,s') = b^^i, 
we know that in state s', («,6) is in the response buffer of Sk- Hence, we conclude that in state 
s", (6, «) is in the response buffer of S^- Thus, by well-formedness of Pi w.r.t. S^, in state s", the 
invocation buffer of Sk contains no invocation («, a), for any a G Tk-invs. Now s" lies along a fault- 
free execution. Hence, dummy j^ j is not enabled in s". Hence, in state s", the only action of task 
stk,i that is enabled is bk,i (see Figure 1). Hence action{e, s") = bk,i- Hence Pi G participants (e, s"). 

Thus, for all possible cases of e, we have established Pi G participants (e, s") . Hence, from (1) 
a' is applicable to s", and (2) 7, 7" contain the same decide actions, which we showed above, we 
can apply exactly the same argument as used in the proof of Claim 1 to establish Claim 4.2. 
(end proof of Claim 4.2) 

Since a' is a failure-free schedule, and Ofsg is a finite failure-free execution, we conclude that 
a'{asg) is a finite failure-free execution. Since sq is 0-valent, it follows that a'{asg) contains at least 
one decide{0)j action, for some j G /. 

Since a' is a failure-free schedule, and as^ is a finite failure-free execution, we conclude that 
a'{asi) is a finite failure-free execution. Since si is 1-valent, it follows that a'{asi) contains at least 
one decide{l)ji action, for some j' G /. 

Let 7 be the suffix of a'{asi), 70 be the suffix of a'{asg), and 71 be the suffix of a'{as^). 
Prom Claims 4.1 and 4.2, we have that 7, 70, and 71 all contain the same decide actions. By its 
construction, 7 contains a single decide action. Hence, 70, 71 contain a single decide{v)i action in 
common, for some v G {0, 1}, i & I. Choosing v = contradicts the 1-valency of si, and choosing 
V = 1 contradicts the 0- valency of sq. Hence, we have derived the desired contradiction, 
(end of CASE 4) 

Since we have established a contradiction in all of CASES 1-4, the lemma holds. □ 

Lemma 6 Let as be a finite input-first failure-free bivalent execution ofG{P), and let last{as) = s. 
Let e be a task of P applicable to as- Let 

U = {au I ciu = cr{as), a is a finite failure-free schedule applicable to ag and not containing e}, 

V = {e{au) \ au & U and e is applicable to au}- 
Then V contains a bivalent execution. 

Proof. In the statement of Lemma 4, a^ is a finite failure- free execution and cr is a finite failure- free 
schedule. Hence, condition (2) of Lemma 4 is the existence of a hook in G{P) whose stem is a finite 
input-first failure-free execution. By Lemma 5, we know that (2) cannot hold. Thus, the desired 
result follows immediately from Lemma 4. D 

We now present the proof of Theorem 1: 

Assume that P is such a distributed system with services. Using Lemma 6, we construct an 
infinite execution 7 of P in which no decide action occurs. By Lemma 3, P must have a bivalent 
initialization. Call it 70. We now apply Lemma 6 to extend 70 repeatedly. 

Fix an arbitrary round- robin order of all the tasks in P, except for the init{v)i and faili tasks. 
Let 7i be the current execution, and let tj be the next task in the round robin order. Assume 
inductively that 7^ is bivalent. (70 gives the base case). 

If tj is not applicable to last{'yi), then move on to the next task in the round robin order, etc. 
until an applicable task is found. Since the process tasks are always applicable, we are guaranteed 
to find an applicable task. So, without loss of generality, let tj be this task. 
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By Lemma 6, there is a bivalent extension 7j_|_i of 7^ such that the last action along 7j_|_i is in 
task e. 

Let 7 be the unique execution such that for all « > 0, 7^ is a prefix of 7. If a task t is continuously 
enabled, then, when it is selected in the round robin order, it will be found applicable to the last 
state of the current execution. Hence, the extension will contain an action from t. Along 7, this 
will happen infinitely often. Hence, 7 satisfies the I/O automaton weak fairness condition. Since 7 
has infinitely many prefixes 7i, « > 0, that are executions of P, it thus follows that 7 is an execution 
of P. Since none of the 7^ contain a decide action, it follows that 7 does not either. D 



5 k-set consensus 

We now show that when the system is solving a problem that is weaker than consensus, namely 
A;-consensus (section 2.2), it is possible to boost the fault- tolerance level. Assume we have available 
/-fault-tolerant A;-consensus services, each one with m ports. An /'-fault-tolerant algorithm that 
solves A;'-consensus is as follows. Take a principal subset of the processes, and divide it into s 
disjoint groups, each one accessing a different service. Each principal process participates in an 
execution proposing its input value to its designated service. If and when it gets a decision back, it 
sends the decision to all the other processes in the entire set of processes (not just those involved 
in the same consensus service). Meanwhile, each principal process collects all the results it receives 
from all processes, and decides on any of these results. The remaining processes simply wait for 
a result from one of the principal processes. The values of k' and /' depend on the size of the 
principal set, and on the number s of services we divide it into. There is a tradeoff between k' and 
/': if a small number of failures /' is tolerated, then a high degree of agreement is achieved, namely 
a small k'. If more failures /' must be tolerated, then a lower degree of agreement is achieved, 
namely a large k'. 

To prove correctness, we divide the principal processes appropriately into the services they 
access. We must ensure that less than s • (/ + 1) principal process can fail, i.e., /' < s • (/ + 1), to 
guarantee that at least one service S has at most / failures. Service S is therefore not killed, and 
moreover, S has at least one nonfaulty participant, who succeeds in sending the value to everyone. 
That means that every nonfaulty process decides. The value of k' , i.e., the number of possible 
different decision values is at most s ■ k: there are at most k different values returned per service; 
more precisely, at most k values per service being accessed by at least k processes, and c values for a 
service that is being accessed by c processes for c < k. Thus, for a desired overall fault-tolerance /', 
we want the smallest possible k' and so we find the smallest integer s that guarantees /' < s-{f + 1). 
Thus we use s = [(/' + !)/(/ + 1)] services, and take the first /' + 1 processes to be the principal 
processes (/' + 1 processes using as few services as possible, each one with / + 1 input ports). It 
follows that 

Theorem 7 For any 1 < k < m, k < f <m — 1, I < f < n — 1, it is possible to solve f -tolerant 
k' -consensus for an endpoint set of n processes using f -tolerant k-consensus services, each one with 
m ports, for 

/■' + 1 

■' +min(A,-, (/' + l)mod(/ + l)). 



k' = k 



/ + 1 



When each service is completely reliable, that is / = m — 1, and we divide the processes as 
described above, this algorithm reduces to the one of [HROO], and gives an upper bound proved to 
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be tight using topology. As an example, we want to build an /' = 2c— 1-fault-tolerant algorithm for 
an endpoint set containing at least 2c processes, and using only 1-fault-tolerant consensus services, 
i.e., f = 1, k = 1. The smallest k' for which we can do this is k' = c, using s = c services, each 
with 2 processes (/' + 1 = 2c principal processes). 



6 Further Work and Conclusions 

We studied the consensus problem in an asynchronous distributed system with stopping failures, 
and where processes can access services that abstract oracles such as hardware primitives or failure 
detectors. Many papers have studied a similar model, but to our knowledge this is the first time 
services that are implemented by the processes in the system are considered. We showed that /- 
tolerant consensus is not achievable using less fault-tolerant consensus services as building blocks, 
but that A;-consensus can be solved with less fault-tolerant A;'-consensus services as building blocks. 

Our algorithm for A;-consensus generalizes that of [HR94, HROO] for reliable services. That 
algorithm achieves a tight upper bound. It is an open question what is the exact situation for 
k-set consensus in our model: for which k,k',f,f' is it possible to construct a A;-consensus service 
tolerating / failures from A;'-consensus services tolerating /' failures each? This seem to lead to 
more general hierarchy results, in the style of Herlihy's universality result [Her91], the consensus 
wait-free hierarchy [Jay97], and the set-consensus hierarchy e.g. [BG93], all of these for services 
that can fail in our sense. 
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A Technical Background 

Definition 1 (I/O Automaton) An I/O automaton A consists of five components: 

1. A set of states states{A). 

2. A nonempty set start (A) C states (A) of start states. 

3. A signature sig{A) = {in{A),out{A),int{A)) where in{A), out{A), and int{A) are disjoint 
sets of input, output, and internal actions, respectively. Denote by local{A) the set out{A) U 
int{A) and by acts{A) the set in{A) U out{A) U int{A). 

4- A task partition tasks{A), which is a partition of local (A) into at most a countable number 
of classes. 

5. A transition relation steps (A) C states (A) x acts (A) x states (A) 

Let s, s', u,u, , . . . range over states and a, 6, ... range over actions. We say that a is enabled in 
state s iff there exists state s' such that (s, a, s') G steps{A). If i is a task and some action a G i is 
enabled in state s, then we say that task t is enabled in state s. 

An execution fragment oi A is an alternating sequence of states and actions sqUiSi . . . Si-iOiSi . . . 
such that for all i >, {si-iOiSi) G steps{A), i.e., the sequence conforms to the transition relation of 
A. An execution of A is an execution fragment that begins with a state in start (A). 

If q; is a finite execution or execution fragment, then first (a) denotes the first state of a, and 
last{a) denotes the last state of a. If o; is a finite execution or execution fragment, a' is an execution 
fragment, and last{a) = first{a'), then a'^a' denotes the concatenation of a and a'. 
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